Introduction

A picture is worth a thousand words; when presenting and interpreting data this basic idea also applies. There has been, indeed, a growing shift in data analysis toward more visual approaches to both interpretation and dissemination of numerical analysis. Part of the new data revolution consists in the mixing of ideas from visualisation of statistical analysis and visual design. Indeed data visualisation is one of the most interesting areas of development in the field.

Good graphics not only help researchers to make their data easier to understand by the general public. They are also a useful way for understanding the data ourselves. In many ways it is very often a more intuitive way to understand patterns in our data than trying to look at numerical results presented in a tabular form.

Recent research has revealed that papers which have good graphics are perceived as overall more clear and more interesting, and their authors perceived as smarter (see this presentation)

As with other aspects of R, there are a number of core functions that can be used to produced graphics. For example, we’ve already used plot(). However these offer limited possibilities for building graphs, and it is by exploring packages that are developed especially for graphing.

The package we will be using throughout this tutorial is ggplot2. The aim of ggplot is to implement the grammar of graphics. The ggplot2 package has excellent online documentation.

If you don’t already have the package installed, you will need to do so using the install.packages() function.

You will then need to load up the package

library(ggplot2)

The grammar of graphics defines various components of the graphic. Some of the most important are:

-The data: For using ggplot2 the data has to be stored as a data frame

-The geoms: They describe the objects that represent the data (e.g., points, lines, polygons, etc.).

-The aesthetics: They describe the visual characteristics that represent data (e.g., position, size, colour, shape, transparency).

-Facets: They describe how data is split into subsets and displayed as multiple small graphs.

-Stats: They describe statistical transformations that typically summarise data.

Anatomy of a plot

Essentially the philosophy behind this as that all graphics are made up of layers. Ggplot2 is based on the grammar of graphics, the idea that you can build every graph from the same few components: a data set, a set of geoms—visual marks that represent data points, and a coordinate system.

Take this example (all taken from Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1), 3-28.)

You have a table such as:

You then want to plot this. To do so, you want to create a plot that combines the following layers:

This will result in a final plot:

Let’s have a look at what this looks like for a graph.

Let’s have a look at this with some data. In particular, we can look at some Home Office statistics on football related arrests and banning orders in England and Wales.

To make things easier, I have a cleaned version of this data saved. If you are interested to read about how this clean version was made fromm the original, you can have a look at the lab session for the undergraduate class (part of introductory data course) where I teach about data cleaning in Lesson 1 here.

But for now, let’s read in the data. To do this in R, we use the read.csv() function. We create the fbo object, which will hold the data.

To create an object in R, you have to give it a name (this can be anything, but usually good to keep it descriptive, and call it fbo_data instead of like dataset123) and then use the assignment operator (or “arrow thing”) which looks like this: <- to give it a value. So to create an object, called “my_name” and give it the value “bob” we would use the following code:

my_name <- "Bob"

Then, whenever you want to call this object, you just have to type it’s name. So if I wanted to use the tolower() function to transform the name “Bob” to lowercase, I can just use the object:

tolower(my_name)
[1] "bob"

Now we can use the same approach to create the fbo object with the read.csv() function, and inside the function we have to put a path for the data to be reached. In this case, the data can be reached from here: https://raw.githubusercontent.com/rcatlord/GSinR_materials/master/6_visualise/fbo-by-club-supported-cleaned.csv. So this is the value we put in the read.csv() function.

fbo <- read.csv("https://raw.githubusercontent.com/rcatlord/GSinR_materials/master/6_visualise/fbo-by-club-supported-cleaned.csv")

This has created this fbo object. If you wanted to have a look at it, you can use the head() function to print the first 6 lines, or the View() function to view it all.

head(fbo)
    Club.Supported Banning.Orders League.of.the.Club.Supported
1         Arsenal              60               Premier League
2      Aston Villa             57               Premier League
3  Birmingham City             25               Premier League
4 Blackburn Rovers             35               Premier League
5        Blackpool             22               Premier League
6 Bolton Wanderers             35               Premier League

Now let’s look at different number of banning orders for clubs in different leagues. As a first step, let’s just plot the number of banning orders for each club. Let’s build this plot:

ggplot(data = fbo, aes(x = Club.Supported, y=Banning.Orders)) +          #data
   geom_point() +                           #geometry
  theme_bw()                                    #backgroud coordinate system

The first line above begins a plot by calling the ggplot() function, and putting the data into it. You have to name your dataframe, and then, within the aes() command you pass the specific variables which you want to plot. In this case, we only want to see the distribution of one variable, crime.

The second line is where we add the geometry. This is where we tell R what we want the graph to be. Here we say we want it to be points. You can see a list of all possible geoms here.

The third line is where we can tweak the display of the graph. Here I used theme_bw() which is a nice clean theme. You can try with other themes. To get a list of themes you can also see the resource here.

ggplot(data = fbo, aes(x = Club.Supported, y=Banning.Orders)) +     #data
   geom_point() +                                                   #geometry
  theme_dark()                                                      #backgroud coordinate system

Changing the theme is not all you do with the third element. For example here you can’t really read the axis labels, because they’re all overlapping. One solution would be to rotate your axis labels 90 degrees, with the following code: axis.text.x = element_text(angle = 90, hjust = 1). You pass this code to the theme argument.

ggplot(data = fbo, aes(x = Club.Supported, y = Banning.Orders)) + geom_point() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

OK what if we don’t want it to be points, but instead we wanted it to be a bar graph?

ggplot(data = fbo, aes(x = Club.Supported, y=Banning.Orders)) +   #data
   geom_bar(stat = "identity") +                                  #geometry
  theme(axis.text.x = element_text(angle = 90, hjust = 1))                                                       #backgroud coordinate system

You might notice here we pass an argument stat = "identity" to geom_bar() function. This is because you can have a bar graph where the height of the bar shows frequency (stat = “count”), or where the height is taken from a variable in your dataframe (stat = “identity”). Here we specified a y-value (height) as the Banning.Orders variable.

So this is cool! But what if I like both?

Well this is the beauty of the layering approach of ggplot2. You can layer on as many geoms as your heart desires!

ggplot(data = fbo, aes(x = Club.Supported, y=Banning.Orders)) +  #data
  geom_bar(stat = "identity") +                                  #geometry 1 
  geom_point()+                                                  #geometry 2
theme(axis.text.x = element_text(angle = 90, hjust = 1))                                                      #backgroud coordinate system

You can add other things too. For example you can add the mean number of Banning Orders as a line:

ggplot(data = fbo, aes(x = Club.Supported, y=Banning.Orders)) +  #data
  geom_bar(stat = "identity") +                                  #geometry 1 
  geom_point() +                                                 #geometry 2
  geom_hline(yintercept = mean(fbo$Banning.Orders)) +            #mean line
theme(axis.text.x = element_text(angle = 90, hjust = 1))                                                      #backgroud coordinate system

This is basically all you need to know to build a graph!

But remember all we learned about good design. So what do we want to be able to say here? Probably, we want to talk about which teams have the highest numbers of banning orders. Let’s stick with the bar chart for this:

ggplot(data = fbo, aes(x = Club.Supported, y=Banning.Orders)) +  #data
  geom_bar(stat = "identity") +                                  #geometry 1  
  theme_minimal() +                                              #backgroud coordinate system
  theme(axis.text.x = element_text(angle = 90, hjust = 1))       #flip x axis labels

The text here is still a bit jumbled. Let’s make sure that we can read everything.

ggplot(data = fbo, aes(x = Club.Supported, y=Banning.Orders)) +  #data
  geom_bar(stat = "identity") +                                  #geometry 1  
  theme_minimal() +                                              #backgroud coordinate system
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6))       #add size = 6 to make text smaller

Reading sideways is weird, and the font is tiny. Another option is to flip the plot with coord_flip():

ggplot(data = fbo, aes(x = Club.Supported, y=Banning.Orders)) +  #data
  geom_bar(stat = "identity") +                                  #geometry 1  
  theme_minimal() +                                              #backgroud coordinate system
  coord_flip()

Much better, but we could still tell a better story. We care about the order here. So it would make sense to sort the clubs by the number of banning orders. To do this, we can just use the reorder() function, and specify what we want to order, and by what variable. Like so:

ggplot(data = fbo, aes(x = reorder(Club.Supported, Banning.Orders), y=Banning.Orders)) +  #data
  geom_bar(stat = "identity") +                                  #geometry 1  
  theme_minimal() +                                              #backgroud coordinate system
  coord_flip()

Alright, now we can move on to exploring different types of charts in R.

What graph should I use?

There are a lot of points to consider when you are choosing what graph to use to visually represent your data. There are some best practice guidelines, but at the end of the day, you need to consider what is best for your data. What do you want to show? What graph will best communicate your message? Is it a comparison between groups? Is it the frequency distribution of 1 variable?

As some guidance, you can use the below cheatsheet, taken from Nathan Yau’s blog Flowingdata:

However, keep in mind that this is more of a guideline, aimed to nudge you in the right direction. There are many ways to visualise the same data, and sometimes you might want to experiment with some of these, see what the differences are.

There is also a vast amount of research into what works in displaying quantitative information. The classic book is The Visual Dispay of Quantitative Information by Edward Tufte, but since him there are many other researchers as well who focus on approaches to displaying data.

They tend to result in recommendations on what to use (and not use) in certain contexts.

For example, most data visualisation experts agree that you should not use 3D graphics unless there is a meaning to the third dimension. So using 3D graphics just for decoration, as in this case is normally frowed upon. However there are cases when including a third dimension is vital to communicating your findings. See this example.

Also often certain chart types are villanified. For example, the pie chart is one such example. A lot of people really dislike piecharts, eg see here or here. If you want to display proportion, research indicates that a square pie chart is more likely to be interpreted correctly by viewers: see here

In any case, the plot that you use depends on the data you are plotting, as well as the message you want to convey with the plot.

So for example, returning again to the difference between number of banning orders between clubs in different leagues, what are some ways of plotting these?

One suggestion is to make a histogram for each one. You can use ggplot’s facet_wrap() option to split graphs by a grouping variable. For example, to create a histogram of banning orders you write:

ggplot(data = fbo, aes(x = Banning.Orders)) + geom_histogram()

Now to split this by League.of.the.Club.Supported, you use facet_wrap() in the coordinate layer of the plot.

ggplot(data = fbo, aes(x = Banning.Orders)) + geom_histogram() + facet_wrap(~League.of.the.Club.Supported)

Well you can see there’s different distribution in each league. But is this easy to compare? Maybe another approach would make it easier?

Personally I like Box plots for showing distribution. So let’s try:

ggplot(data = fbo, aes(x = League.of.the.Club.Supported, y = Banning.Orders)) + 
    geom_boxplot()

This makes the comparison significantly easier, right? But the order is strange! When R does not know the order in which to display a variable, it will arrange them alphabetically. But often our variables will have a natural order. In this case, there is an order to the leagues. But R does not know this, unless we tell it. We can tell it using the factor() function.

fbo$League.of.the.Club.Supported <- factor(fbo$League.of.the.Club.Supported, 
    levels = c("Premier League", "Championship", "League One", "League Two", 
        "Other clubs"))

And now create the plot again!

ggplot(data = fbo, aes(x = League.of.the.Club.Supported, y = Banning.Orders)) + 
    geom_boxplot()

Now this is great! We can see that the higher the league the more banning orders they have. Any ideas why?

We’ll now go through some examples of making graphs using ggplot package

ggplot

As mentioned earlier, we will emphasise in this course the use of the ggplot() function. With ggplot() you start with a blank canvass and keep adding specific layers. The ggplot() function can specify the dataset and the aesthetics (the visual characteristics that represent the data).

To get the data we’re going to use here, load up the package “gapminder”.

library(gapminder)

This package has a dataframe called “gapminder”. This is an excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country. To access the codebook (how you find out what variables are) use the “?”.

filter()

This dataframe has data from 1952 to 2007, but here we want to focus on just the most recent year. So let’s select only data from 2007. We can use the filter() function to achieve this. This function is from the dplyr package, so make sure you have loaded it with the library(dplyr) command. Inside, we have to specify what to filter, and the condition that must be met for something to be selected from it. In this case, we are filtering the gapminder dataset, and the condition we want the items we keep to meet is that the year is 2007 (year == 2007).

gapminder_2007 <- filter(gapminder, year == 2007)

OK so let’s make a graph about the variable which represents the life expectancy by country (‘lifeExp’).

Scatter plots with two variables

When looking at the relationship between two quantitative variables nothing beats the scatterplot. This is a lovely article in the history of the scatterplot!

A scatterplot plots one variable in the Y axis, and another in the X axis. Typically, if you have a clear outcome or response variable in mind, you place it in the Y axis, and you place the explanatory variable in the X axis.

This is how you produce a scatterplot with ggplot():

# A scatterplot of crime versus median value of the properties
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) + geom_point()

Each point represents a case in our dataset and the coordinates attached to it in this two dimensional plane are given by their value in the Y (crime) and X (median value of the properties) variables.

One of the things you may notice with a scatterplot is that even with a smallish dataset such as this, with just about 142 cases, overplotting may be a problem. When you have many cases with similar (or even worse the same) value, it is difficult to tell them apart. There’s a variety of ways of dealing with overplotting. One possibility is to add some transparency to the points:

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.4)  #you will have to test different values for alpha

Overplotting can occur when a continuous measurement is rounded to some convenient unit. This has the effect of changing a continuous variable into a discrete ordinal variable. For example, age is measured in years and body weight is measured in pounds or kilograms. Age is a discrete variable, it only takes integer values. That’s why you see the points lined up in parallel vertical lines. This also contributes to the overplotting in this case.

One way of dealing with this particular problem is by jittering. Jittering is the act of adding random noise to data in order to prevent overplotting in statistical graphs. In ggplot one way of doing this is by passing an argument to geom_point specifying you want to jitter the points. This will introduce some random noise so that age looks less discrete.

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.2, 
    position = "jitter")  #Alternatively you could replace geom_point() with geom_jitter() in which case you don't need to specify the position

Another alternative for solving overplotting is to bin the data into rectangles and map the density of the points to the fill of the colour of the rectangles.

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) + stat_bin2d()

#The same but with nicer graphical parameters
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) +
  stat_bin2d(bins=50) + #by increasing the number of bins we get more granularity
  scale_fill_gradient(low = "lightblue", high = "red") #change colors

What this is doing is creating boxes within the two dimensional plane; counting the number of points within those boxes; and attaching a colour to the box in function of the density of points within each of the rectangles.

When looking at scatterplots, sometimes it is useful to summarise the relationships by mean of drawing lines. You could for example add a line representing the conditional mean. A conditional mean is simply mean of your Y variable for each value of X. We can ask R to plot a line connecting these means using geom_line() and specifying you want the conditional means.

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.4) + 
    geom_line(stat = "summary", fun.y = mean)

With only 142 cases there are loads of ups and downs. If you have many more cases for each level of X the line would look less rough. You can, in any case, produce a smoother line using geom_smooth instead. This is computed using the “loess” method. A Loess regression line subsets chunks of data around your X axis to try to estimate a regression line that fits well a region of the data.

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.4) + 
    geom_smooth(colour = "red", size = 1, se = FALSE)  #The se argument asks whether to display the confidence intervals around smooth, colour is simply asking for a red line instead of blue, and size makes the line a bit thicker with size 1)

As you can see here you produce a smoother line than with the conditional means. The line, as the scatterplot, seems to be suggesting an overall curvilinear relationship that almost flattens out once the GDP is around 30,000.

Scatter plots conditioning in a third variable

There are various ways to plot a third variable in a scatterplot. You could go 3D and in some contexts that may be appropriate. But more often than not it is preferable to use only a two dimensional plot.

If you have a grouping variable you could map it to the colour of the points as one of the aesthetics arguments. Here we return to the Boston scatterplot but will add a third variable, that indicates whether the town is located by the river or not.

# Scatterplot with two quantitative variables and a grouping variable, we
# are telling R to tell 'continent', as a factor.
ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, colour = continent)) + 
    geom_point()

We can see distinct differences between continents.

As before you can add smooth lines to capture the relationship. What happens now, though, is that ggplot will produce a line for each of the levels in the categorical variable grouping the cases:

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, colour = continent)) +
  geom_point(alpha=.4) + #I am doing the points semi-transparent to see the lines better
  geom_smooth(se=FALSE, size=1) #I am doing the lines thicker to see them better

You can see how the relationship between life expectancy and GDP is more marked for different continents.

We can also map a quantitative variable to the colour aesthetic. When we do that, instead of different colours for each category we have a gradation in colour from darker to lighter depending on the value of the quantitative variable. Below we display the relationship between life expectancy and gdp conditioning on the size of the population of the country.

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, colour = pop)) + geom_point()

You could map the third variable to a different aesthetic (rather than colour). For example, you could map lstat to size of the points. This is called a bubblechart. The problem with this, however, is that it can make overplotting more acute sometimes.

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp, size = pop)) + geom_point()  #You may want to add alpha for some transparency here.

If you have larger samples and the patterns are not clear, conditioning in a third variable can produce hard to read scatterplots (even if you use transparencies and jittering). In these cases, we could try to use facets instead using facet_wrap().

ggplot(gapminder_2007, aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.4, 
    position = "jitter") + facet_wrap(~continent)

Bar charts

You may be wondering what about categorical data? So far we have only discussed various visualisations where at least one of your variables is quantitative. When your variable is categorical you can use bar plots. We map the factor variable in the aesthetics and then use the geom_bar() function to ask for a bar chart.

mutate()

Often we visualise data because we want to compare distributions. Most of data analysis is about making comparisons. We are going to explore whether the life expectancy is different for less affluent countries. The variable gdpPercap measures the country’s GDP per capita. For the purposes of this illustration I want to dichotomise this variable. Here we make use of the mutate() function to create a new variable with an ifelse() condition. Here we will be grouping anything that’s got GDP higher than the mean as “high gdp”, and everything else as “low gdp”.

gapminder_2007 <- gapminder_2007 %>% mutate(gdp_factor = if_else(gdpPercap > 
    mean(gdpPercap), "high gdp", "low gdp"))

We also used the piping operator above. This is the %>% function. This pastes together functions, to make it easier to do more things in one bit of code. If you don’t want to type it out each time, you can use the CTRL + SHIFT + M (or CMD + SHIFT + M for mac) shortcut.

We also want to make sure that this variable is a factor. The variable we created was a character vector, this step transforms it into a factor (many functions designed to work with categorical variables expect a factor as an input, not just a character vector)

gapminder_2007$gdp_factor <- as.factor(gapminder_2007$gdp_factor)

Now we can plot this new variable we’ve created.

ggplot(gapminder_2007, aes(x = gdp_factor)) + geom_bar()

Unfortunately, the levels in this factor are ordered by alphabetical order, which is not always what we want. We can modify this by reordering the factors levels first -click here for more details. You could do this within the ggplot function (just for the visualisation), but in real life you would want to sort out your factor levels in an appropriate manner more permanently. As discussed above, this is the sort of thing you do as part of pre-processing your data. And then plot.

# Print the original order
print(levels(gapminder_2007$gdp_factor))
[1] "high gdp" "low gdp" 
# Reordering the factor levels. Notice that I am creating a new variable, it
# is often not unwise to do this to avoid messing up your original data.
gapminder_2007$gdp_factor_2 <- factor(gapminder_2007$gdp_factor, levels = c("low gdp", 
    "high gdp"))
# Plotting the variable again (and subsetting out the NA data)
ggplot(gapminder_2007, aes(x = gdp_factor_2)) + geom_bar()

We can also map a second variable to the aesthetics, for example, let’s look at victimisation in relation to feelings of safety. For this we produce a stacked bar chart.

ggplot(gapminder_2007, aes(x = continent, fill = gdp_factor_2)) + geom_bar()

These sort of stacked bar charts are not terribly helpful if you are interested in understanding the relationship between these two variables. Instead what you want is a proportional stacked bar chart, that gives you the proportion of your “explanatory variable” (here victimisation in the past 12 months) within each of the levels of your “response variable” (here feelings of safety).

First we need to scale the data to 100% within each stack and then plot. The code here is also more complex. Again, I’m not expecting you to fully understand all of it. In your early days of using R, you will find yourself cutting and pasting chunks of code that you will only fully understand with practice.

library(scales)  #load the scales library

ggplot(gapminder_2007, aes(x = continent, fill = gdp_factor_2)) + geom_bar(position = "fill") + 
    scale_y_continuous(labels = percent_format())

Here we can see which continents have higher percentage of the countries in high GDP category.

Sometimes, you may want to flip the axis, so that the bars are displayed horizontally. You can use the coord_flip() function for that.

# First we invoke the plot we created and stored earlier and then we add an
# additional specification with coord_flip()
ggplot(gapminder_2007, aes(x = continent, fill = gdp_factor_2)) + geom_bar(position = "fill") + 
    scale_y_continuous(labels = percent_format()) + coord_flip()

You can also use coord_flip() with other ggplot plots (e.g., boxplots).

A particular type of bar chart is the divergent stacked bar chart, often used to visualise Likert scales. You may want to look at some of the options available for it via the HH package, the likert package or sjPlot. But we won’t cover them here in detail.

Keep in mind that knowing how to get R to produce a particular visualisation is only half the job. The other half is knowing when to produce a particular kind of visualisation. This blog, for example, discusses some of the problems with stacked bar charts and the exceptional circumstances in which you may want to use them.

There are other tools sometimes used for visualising categorical data. Increasingly popular are mosaic plots. R is pretty good at doing them. However, we have covered already a lot of ground, plus I have some sympathy for Stephen Few’s arguments against mosaic plots.

What I would use instead are waffle charts. They’re super easy to make with the “waffle” package.

Here is an example using some Brexit data:

library(waffle)

We will specify colours with their hex value.

t <- c(Remain = 35, Leave = 37, `Didn't vote` = 28)

waffle(t, rows = 10, colors = c("#a6cee3", "#1f78b4", "#33a02c"), title = "Brexit results", 
    xlab = "1 square = 1%")

You can read more about them here.

Histograms

Histograms are useful ways of representing quantitative variables visually.

If you want to produce a histogram with the ggplot function you would use the following equivalent code:

ggplot(gapminder_2007, aes(x = lifeExp)) + #The "aes" argument defines the aesthetics for the plot, here we are telling R the data we will represent comes from the gapminder data and our X variable, displayed in the X axis, will be life expectancy. So we are specfying we are using the "lifeExp" variable to display a position in the X axis.
  geom_histogram() #If we just run the function in the first line we won't be plotting anything, we have to gell R what type of object, or geom, is going to represent the data that we specified in the aesthetics. There are multiple geoms we can use. For a histogram we use the geom_histogram.

So you can see that ggplot works in a way that you can add a series of additional specifications (layers, annotations). In this simple plot the ggplot function simply maps lifeExp as the variable to be displayed (as one of the aesthetics) and the dataset. Then you add the geom_histogram to tell R that you want this variable to be represented as a histogram. Later we will see what other things you can add.

A histogram is simply putting cases in “bins” and then creates a bar for each bin. You can think of it as a visual grouped frequency distribution. The code we have used so far has used a bin-width of size range/30 as R kindly reminded us. But you can modify this parameter if you want to get a rougher or a more granular picture. In fact, you should always play around with different specifications of the bin-width until you find one that tells the full story in a parsimonious way.

ggplot(gapminder_2007, aes(x = lifeExp)) + geom_histogram(binwidth = 1)  #We can pass arguments to the the geoms, here we are changing the size of the bins (for further details on other arguments you can check the help files)

Using bin-width of 1 we are essentially creating a bar for every one unit increase in the life expectancy. We can see that most countries have life expectancy of 60 or over.

# Let's sum the number of countries with a value of 60 or over in life
# expectancy
sum(gapminder_2007$lifeExp >= 60)
[1] 99

We can see that the large majority of countries, 99 out of 142, have a life expectancy over 60. But we can also see that there are some countries that have a low life expectancies. You can see how we can use visualisations to show the data and get a first feeling for how it may be distributed.

Now we can produce the plot:

ggplot(gapminder_2007, aes(x = lifeExp)) + geom_histogram(binwidth = 1) + facet_grid(gdp_factor ~ 
    .)  #Facets are another element of the grammar of graphics, we use it to define subsets of the data to be represented as multiple groups, here we are asking R to produce two plots defined by the two levels of the factor we just created.

Visually this may not look great, but it begins to tell a story. We can see that there is a considerable lower proportion of countries with low life expectency in the group of countries that have higher life expectancies. It is a flatter, less skewed distribution. You can see how the facet_grid() expression is telling R to create the histogram of the variable mentioned in the ggplot function for the groups defined by the categorical input of interest (the factor “gdp_factor”).

Instead of using facets, we could overlay the histograms with a bit of transparency. Transparencies work better in screens than in printed document, so keep in mind this when deciding whether to use them instead of facets. The code is as follows:

ggplot(gapminder_2007, aes(x = lifeExp, fill = gdp_factor)) + #"fill" identifies the factor grouping the cases
  geom_histogram(position = "identity", alpha = 0.4) #"position = identity"" tells R to overlay the distributions and "alpha"" asks for the degree of transparency, a lower value will be more transparent

Density plots

For smoother distributions, you can use density plot. You should have a healthy amount of data to use these or you could end up with a lot of unwanted noise. Let’s first look at the single density plot for all cases:

ggplot(gapminder_2007, aes(x = lifeExp)) + geom_density()

In a density plot the area under the lines sum to 1 and the Y, vertical, axis now gives you the estimated probability for the values in the X, horizontal, axis. As you can observe it provides a smoother representation of the distribution (as compared to the histograms).

You can also use this to compare the distribution of a quantitative variable across the levels in a categorical variable (factor):

# We are mapping 'gdp_factor' as the variable colouring the lines
ggplot(gapminder_2007, aes(x = lifeExp, colour = gdp_factor)) + geom_density()

Or you could use transparencies:

ggplot(gapminder_2007, aes(x = lifeExp, fill = gdp_factor)) + geom_density(alpha = 0.3)

Did you notice the difference with the comparative histograms? By using density plots we are rescaling to ensure the same area for each of the levels in our grouping variable. This makes it easier to compare two groups that have different frequencies. The areas under the curve add up to 1 for both of the groups, whereas in the histogram the area within the bars represent the number of cases in each of the groups. If you have many more cases in one group than the other it may be difficult to make comparisons or to clearly see the distribution for the group with fewer cases. So, this is one of the reasons why you may want to use density plots.

Interactivity

While we did not cover interactivity in this session, there are excellent resources available online, so I do recommend some further reading.

Other ways of making your charts and overall presentation of information interactive include building Shiny dashboards.

Also if you have some spatial data, a nice interactive way to present it is to use leaflet package.

There are many many ways to visualise and present data available, and this is a really fun thing to play around with, and one of the major strengths of R. I hope that this introduction has been sufficient to get you interested in learning more, and that you can use R to make fun visualisations of your data!

Resources

Books:

Talks:

Blogs/ other: